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This paper compares eight commercial parallel processors along several dimensions. The 
processors include four shared-bus multiprocessors (the Encore Multimax, the Sequent 
Balance system, the Alliant FX series, and the ELXSI System 6400) and four network 
multiprocessors (the BBN Butterfly, the NCUBE, the Intel IPSC/2, and the FPS T Series). 
The paper contrasts the computers from the standpoint of interconnection structures, 
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We present a survey of the state-of-the-art techniques used in performing data and 
memory-related optimizations in embedded systems. The optimizations are targeted 
directly or indirectly at the memory subsystem, and impact one or more out of three 
important cost metrics: area, performance, and power dissipation of the resulting 
implementation. We first examine architecture-independent optimizations in the form of 
code transoformations. We next cover a broad spectrum of optimizati ... 

Keywords: DRAM, SRAM, address generation, allocation, architecture exploration, code 
transformation, data cache, data optimization, high-level synthesis, memory architecture 
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Efficient utilization of on-chip memory space is extremely important In modern embedded 
system applications based on processor cores. In addition to a data cache that interfaces 
with slower off-chip memory, a fast on-chip SRAM, called Scratch-Pad memory, is often 
used in several applications, so that critical data can be stored there with a guaranteed 
fast access time. We present a technique for efficiently exploiting on-chip Scratch-Pad 
memory by partitioning the application's scalar and a ... 

Keywords: data cache, data partitioning, memory synthesis, on-chip memory, scratch- 
pad memory, system design, system synthesis 
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In an effort to push the envelope of system perfornnance, microprocessor designs are 
continually exploiting higher levels of instruction- level parallelism, resulting in increasing 
bandwidth demands on the address translation mechanisni. Most current microprocessor 
designs meet this demand with a multi-ported TLB. While this design provides an excellent 
hit rate at each port, its access latency and area grow very quickly as the number of ports 
is increased. As bandwidth demands continue to increase 

A COM effective architect ure for vectorizable numerical and multimedia applications | 

Francisca Quintana, Jesus Corbal, Roger Espasa, Mateo Valero 

July 2001 Proceedings of the thirteenth annual ACM symposium on Parallel 

algorithms and architectures 
Publisher: ACM Press 

Full text available: 'g] pdf(293.82 KB) Additional Information: full citation , abstract , references , index terms 

This paper analyzes the performance of vector-dominated regions of code in numerical and 
multimedia applications in a superscalar+vector architecture and compares it to an 8-way 
superscalar processor. The ability to split a program's execution into scalar and vector 
regions allows us to show that (1) as expected, the vector unit is much better than the 
wide issue superscalar at executing the vector-dominated regions of the code; (2) on the 
scalar regions, the 8-way superscalar, although bette ... 
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Publisher: IEEE Computer Society Press, ACM' Press . 
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Full text available:^ pdf d. 59 MB) terms 

Very Long Instruction Word (VLIW) architectures were promised to deliver far nnore than 
the factor of two or three that current architectures achieve from overlapped execution. 
Using a new type of compiler which compacts ordinary siequential code into long 
instruction words, a VLIW machine was expected to provide from ten to thirty times the 
performance of a more conventional machine built of the same implementation 
technology.Multlflow Computer, Inc., has now built a VLIW called the TRACE™< • 



High- performance multi-queue buffers for VLSI communications switches | 
Y. Tamir, G. L. Frazier 

May 1988 ACM SIGARCH Computer Architecture News , Proceedings of the 15th 
Annual International Symposium on Computer architecture ISCA '88, 
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Full text available- IfJ pdf( 1 41 IVIB) Additional Information: full citation , abstract , references , citin gs, index 

. terms 

Small nxn switches are key components of multistage Interconnection networks used in 
multiprocessors as well as in the communication coprocessors used in multicomputers. The 
design of the internal buffers in these switches is of critical importance for achieving high 
throughput low latency communication. We discuss several buffer structures and compare 
them in terms of implementation complexity and their ability to deal with variations in 
traffic patterns a ... 
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T. N. Mudge, B. A. Makrucki 

April 1982 Proceedings of the 9th annual symposium on Computer Architecture 
Publisher: IEEE Computer Society Press 
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• L_J.. ^ te rms 

This paper presents a probabilistic analysis of a crossbar switch interconnection network. A 
crossbar switch can be used to interconnect various coniblnations of computer 
subsystems. In the analysis below it is assumed, without loss of generality, that the 
crossbar is being used to connect N processors to M memories. The crossbar is termed an 
N-M crossbar (read ""N to M crossbar"). General expressions are developed for a variety of 
performance fig ... 



10 Half-price architecture 
Ilhyun Kim, Mikko H. Lipasti 

May 2003 ACM SIGARCH Computer Architecture News , Proceedings of the 30th 

annual international symposium on Computer architecture ISCA '03, volume 

31 Issue 2 
Publisher: ACM Press 

Full text available: pdf( 278.61 KB ) Additional Information: full citation , abstract , references , citing s 

Current-generation microprocessors are designed to process instructions with one and two 
source operands at equal cost. Handling two source operands requires multiple ports for 
each instruction in structures-such as the register file and wakeup iogic-which are often 
in the processor's critical timing paths. We argue that these structures are overdesigned 
since only a small fraction of instructions require two source operands to be processed 
simultaneously. In this paper, we propose the half-pr ... 

Design s pace exploration for 3D architectures 
Yuan Xie, Gabriel H. Loh, Bryan Black, Kerry Bernstein 

April 2006 ACM Journal on Emerging Technologies in Computing Systems (JETC), 

Volume 2 Issue 2 
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Publisher: ACM Press 

Full text available: ^pdf( 1.93MB) Additional Information: full citation , abstract , references , index terms 

As technology scales, interconnects have become a major performance bottleneck and a 
major source of power consumption for microprocessors. Increasing interconnect costs 
make it necessary to consider alternate ways of building modern microprocessors. One 
promising option is 3D architectures where a stack of multiple device layers with direct 
vertical tunneling through them are put together on the same chip. As fabrication of 3D 
integrated circuits has become viable, developing CAD tools and arch ... 

Keywords: 3D integration, hardware, microarchitecture, processor architectures 



Trace cache: a low latency approach to high bandwidth instructign fetching 
Eric Rotenberg, Steve Bennett, James E. Smith 

December 1996 Proceedings of the 29th annual ACM/IEEE international symposium 

on Microarchitecture 
Publisher: IEEE Computer Society 

Full text available: pdf(1 38 MB) Additional Information: full citation , abstract , references , citings, index 

teirns 

As the issue width of superscalar processors is increased, instruction fetch bandwidth 
requirements will also increase. It will become necessary to fetch multiple basic blocks per 
cycle. Conventional instruction caches hinder this effort because long instruction 
sequences are not always in contiguous cache locations. We propose supplementing the 
conventional instruction cache with a trace cache. This structure caches traces of the 
dynamic instruction stream, so instructions that are otherwise no ... 

Keywords: instruction cache, instruction fetching, multiple branch prediction, superscalar 
processors, trace cache 



Results 1 - 12 of 12 

The ACM Portal is published by the Association for Computing Machinery. Copyright © 2006 ACM, Inc. 
Terms of Usa ge Privacy Polic y Code of Ethics Cont act Us 

Useful downloads: ^ Adobe Acrobat 0 QuickTime jj Windows Media Play er ^ Real Player 




http://portaLacm.org/results.cfm?CFID=88026&CFTOKEN=36179764&adv=l&COLL=AC... 9/3/06 



